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Abstract 

While evolutionary algorithms are known to be very successful for a broad range of applica¬ 
tions, the algorithm designer is often left with many algorithmic choices, for example, the size of 
the population, the mutation rates, and the crossover rates of the algorithm. These parameters 
are known to have a crucial influence on the optimization time, and thus need to be chosen 
carefully, a task that often requires substantial efforts. Moreover, the optimal parameters can 
change during the optimization process. It is therefore of great interest to design mechanisms 
that dynamically choose best-possible parameters. An example for such an update mechanism 
is the one-fifth success rule for step-size adaption in evolutionary strategies. While in continuous 
domains this principle is well understood also from a mathematical point of view, no comparable 
theory is available for problems in discrete domains. 

In this work we show that the one-fifth success rule can be effective also in discrete settings. 
We regard the (1 -I- (A, A)) GA proposed in [Doerr/Doerr/Ebel: From black-box complexity to 
designing new genetic algorithms, TCS 2015]. We prove that if its population size is chosen 
according to the one-fifth success rule then the expected optimization time on OneMax is 
linear. This is better than what any static population size A can achieve and is asymptotically 
optimal also among all adaptive parameter choices. 


1 Introduction 

It is widely acknowledged that setting the parameters of evolutionary algorithms (EA) is one of 
the key difficulties in evolutionary optimization. Eiben, Hinterding, and Michalewicz [13] call this 
challenge “one of the most important and promising areas of research in evolutionary computation”. 
This statement retains its topicality 15 years after the original publication of [13] as many talks at 
evolutionary computation conferences certify. We also understand today that even small changes 
in the parameters can yield to exponential performance gaps of the regarded algorithms [10,11]. 

Substantial research efforts have been undertaken to find good parameter settings for general 
EAs, see for example [4]. Around the same time it has been discovered that it may be sub-optimal 
to use a fixed set of parameters throughout the whole optimization process. It was suggested instead 
to change the parameters of the algorithms by some dynamic update rules, often using some sort 


1 


of feedback of the fitness landscape that the algorithm is facing. For example, it can be beneficial 
in earlier parts of the process to invest in exploration of the fitness landscape, while the algorithm 
should become more stable and focus on one or few areas of attraction in the later exploitation 
phase(s). 

Interestingly, while in continuous domains parameter control^ is analyzed also theoretically [2, 
15,17], adaptive parameter choices play only a marginal role in theoretical investigations of EAs 
for discrete search spaces, the few exceptions showing an advantage of adaptive parameter settings 
being the fitness-dependent mutation rates for the (1+1) evolutionary algorithm on LeadingOnes 
analyzed in [3] and the fitness-dependent choice of the population size in [9] for the (1 + (A, A)) GA on 
OneMax, respectively. Both these adaption schemes, however, require a very solid understanding 
of the problem at hand and are thus likely to be of limited practical relevance. Another example 
from the discrete EA literature are reductions of the parallel runtime (but not the total optimization 
time) for several test functions when doubling the number of parallel instances in a parallel EA 
after each unsuccessful iteration [19]. 

With this work we provide a first example for a discrete optimization problem where a self- 
adjusting (i.e., adaptive, but not fitness-dependent) parameter choice yields an expected optimiza¬ 
tion time that is better by more than a constant factor than any static parameter choice. The 
parameter update rule is extremely simple and does not require any problem-specific insights. 
More precisely, we analyze the runtime of the already mentioned (1 + (A, A)) GA with population 
sizes chosen according to the one-fifth success rule on the generalized OneMax problem. We also 
show that the one-fifth success update scheme is optimal in this setting, that is, no alternative 
update mechanism can yield a significantly smaller runtime. In fact, we show that, throughout 
the whole optimization process, the one-fifth success rule suggests parameter settings that closely 
follow the theoretically best possible choices. 

1.1 The One-Fifth Rule 

One of the earliest adaptive update rules suggested in the evolutionary computation literature is 
the one-fifth success rule. It was independently discovered in [5,21,22] and constitutes today one of 
the best known and most widely applied techniques in parameter control. Several empirical results 
(cf. [14] and references therein) suggest that EAs using the one-fifth rule for adaptive parameter 
control are quite capable of finding optimal or close to optimal parameter settings. Since the 
parameters are updated without the intervention of the user, such update mechanisms are a very 
convenient way to minimize parameter tuning efforts. Furthermore, the one-fifth success rule does 
not require any problem-specific knowledge and is thus widely applicable. 

Originally, the one-fifth rule was designed to control the step size of evolution strategies. In 
intuitive terms, it suggests that if the probability to create an offspring of better than current-best 
fitness is greater than 1/5, then the step size should be increased, while it should be decreased if the 
probability is lower than 1/5. Today, this rule has found applications much beyond the adaptation 
of the step size. Here in this work we use it for adjusting the offspring population size of a genetic 
algorithm. 

Without going into details, we note that many other adaptive update rules have been exper¬ 
imented with in the evolutionary computation literature, cf. [13], [14, Ghapter 8], and [18] for 

^We use here and in the following a slightly adapted version of the terminology for parameter setting suggested 
in [13]. This terminology is summarized in Section 3.1 and Figure 2. 
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excellent surveys. 


1.2 The (1 + (A,A)) GA 

We regard the (1 + (A, A)) GA, which has been proposed in [9] as a first example of an evolutionary 
algorithm optimizing the generalized OneMax problem using o{n log n) function evaluations. For 
the theory of evolutionary algorithms, this is a big success as it shows for the first time that even 
for such simple problems the usage of crossover can be beneficial (all previously known evolutionary 
algorithms need Q{n log n) function evaluations in expectation to optimize the generalized OneMax 
problem). 

An important parameter of the (1 + (A, A)) GA is A, the number of offspring generated in the 
mutation phase and the subsequent crossover phase of the algorithm. In the original paper [9] it is 
shown that for A = 0{\/Togn) the expected optimization time of the (1 + (A, A)) GA on OneMax 
is 0{n\/logn). In [7] we improve the runtime bound to 0(max{nlog(n)/A, nAloglog(A)/log(A)}). 
This expression is minimized for A = 0(Y^log(n) log log(n)/ log log log(n)), giving an optimization 
time of 0(ny^iog(n)Ioglr)glng(n)7loglog(n)). While this exact expression is irrelevant for the 
purposes of the present paper, it is important to note that this tight bound is super-linear for every 
possible choice of A. 

It was also observed in [9] that the (1 + (A, A)) GA has only linear expected optimization time on 
OneMax when the population size A is chosen adaptively depending on the fitness of the current- 
best individual. While theoretically appealing, this result has limited practical implications since 
the proposed fitness-dependent parameter choice crucially requires a very good understanding of the 
optimization process and thus, of the problem at hand. This is testified by the optimal relation of 
the population size to the current fitness-distance d to the optimum, which is Guessing 

such a functional relationship for real-world optimization problems is typically not doable with 
reasonable efforts. 

Interestingly, an alternative approach based on the one-fifth success rule was suggested in [9]. 
In a series of experimental evaluations it was shown that if the population size A is chosen according 
to this rule, the performance of the resulting (I -|- (A, A)) GA is among the best ones for a series of 
test problems. In that algorithm the population size is increased if no improvement has happened 
in the last iteration, while it is decreased otherwise. More precisely, for a suitable constant F > 1, 
the population size parameter A is multiplied by after each iteration in which the fitness of the 
current-best individual could not be improved, and it is divided by F otherwise, i.e., if the fitness 
of the current-best search point increased in that iteration. 

1.3 Our Results 

While the observations made in [9] are purely empirical, we provide with this work a theoretical 
analysis of the suggested self-adjusting (1-|-(A, A)) GA. We prove that the suggested implementation 
of the one-fifth success rule yields a linear expected optimization time of the (1 -|- (A, A)) GA on the 
generalized OneMax problem. As noted above this is better than what any static parameter choice 
can achieve and is also best possible among all comparison-based algorithms as we shall comment 
in Section 4. In particular, our bound shows that the one-fifth success rule suggests population 
sizes that are asymptotically optimal among all possible (static and dynamic) choices. 

To the best of our knowledge, this is the first time that in a discrete search environment a self- 
adjusting parameter choice is shown to be superior to any static choice. Indeed the (1 -|- (A, A)) GA 
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is the first proven example in discrete evolutionary algorithmics where a non-fitness-dependent 
parameter choice reduces the optimization time by more than a constant factor. The results that 
come closest to this are the mentioned results from [3,19], which are either constant factor reductions 
of the expected runtime (in case of [3]) or reductions of the parallel expected runtime but not the 
total number of function evaluations (in case of [19]). 

Our proof gives some general insights in the working principles of adaptive parameter choices 
in discrete domains, which hopefully lead to future applications of this approach in discrete search. 

Our paper is organized as follows. We first introduce the (1 -|- (A, A)) GA with static population 
sizes, give background on the generalized OneMax problem, and recall known bounds for the 
expected optimization time of the (1 -|- (A, A)) GA on OneMax functions in Section 2. In Section 3 
we present the (1 -|- (A, A)) GA with self-adjusting parameter choices along with a brief summary of 
the mentioned (slightly adapted) classification scheme of Eiben, Hinterding, and Michalewicz [13] 
for parameter settings. We also present in Section 3 the runtime analysis of the self-adjusting 
(1 -|- (A, A)) GA (see Section 3.3), followed by a discussion of the general insights obtained through 
that analysis (Section 3.4). Finally, we show in Section 4 that the one-fifth success rule suggests 
optimal or close to optimal parameter settings. 

2 The (1 + (A, A)) GA with Static Population Size 

Adopting the conventions and notation from [9], we regard here in this work only search spaces 
with bit string representations, we write x = xi... Xn, [k] := {1,2,..., k}, and [0..A:] := [/c] U {0} 
for any bit string x G {0, 1}” and any non-negative integer k. By B{n,p) we denote the binomial 
distribution with n trials and success probability p; i.e., B{n,p){£) = (]^)p^(l — p)'^~^ for any 
£ G [0..n]. 

The (1 -|- (A, A)) GA is given in Algorithm 1. It uses the following two variation operators. 

• The unary mutation operator mut£(-) which, given some x G {0,1}*^, creates from x a new 
bit string y by flipping exactly £ bit entries in it. 

• The binary crossover operator crosSc(-, •) with crossover probability c, which, given two bit 
strings x and x', chooses y := crosSc(x, x') by choosing for each z G [n] ?/* := x[ with probability 
c and choosing yi = Xi otherwise. 

Thus, after a random initialization of the algorithm, in each iteration the following steps are per¬ 
formed. 

• In the mutation phase, A offspring are sampled from the current-best solution x by applying 
A times independently the mutation operator mut£(-) to x, where the step size £ is chosen at 
random from B{n,p) before the generation of the first offspring. 

• In the crossover phase, A offspring are created from x and (one of) the best of the A offspring 
from the mutation phase, x', by sampling independently from crosSc(x, x'). 

• Elitist selection: The best of the individuals of the crossover phase replaces x if its fitness is 
at least as large as the fitness of x. If there are several offspring with best fitness, we disregard 
those that are equal to x and choose one of the remaining ones uniformly at random. 
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Algorithm 1: The (1 + (A, A)) GA with offspring population size A, mutation probability p, 
and crossover probability c. 
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Initialization: Sample x G {0,1}” uniformly at random and query f{x)', 

Optimization: for t = 1, 2, 3, ... do 
Mutation phase: 

Sample f from B{n,p)] 
for i = 1,..., X do 

1^ Sample ■(— mut£(x) and query 

Choose x' G {x(^\ ..., x^^^} with /(x') = max{/(x(^)),..., /(x^^^)} u.a.r.; 

Crossover phase: 
for i = 1,..., X do 

1^ Sample •(— crosSc(x,x') and query /(y^®^); 

If exists, choose y G \ {x} with /(y) = max{/(y(^)),... ,f{y^^^)} u.a.r.; 

otherwise, set y := x; 

Selection step: if /(y) > /(x) then x ■<— y; 


Throughout this paper we shall use p = Xjn and c = 1/A, choices which are well justified in [9, 
Sections 2 and 3]. Only in Section 4 we regard arbitrary choices of the parameters A, p, and c. 

As performance measure we regard the expected running time of the (1 + (A, A)) GA, that is, the 
expected number of function evaluations that the algorithm performs until it evaluates for the hrst 
time an optimal search point x G argmax/. This is the common measure in runtime analysis and 
is sometimes referred to as the expected optimization time. Note that for algorithms performing 
more than one htness evaluation per iteration, such as the (1 + (A, A)) GA, the expected runtime 
can be much different from the expected number of iterations (generations). 

2.1 The Generalized OneMax Problem 

In [9, Section 4] the (1 + (A, A)) GA is analyzed by experimental means. The results show that 
it performs well on OneMax functions, linear functions with random weights, and royal road 
functions. A theoretical investigation, an improved bound of which is stated below in Theorem 1, 
however, is currently available only for the generalized OneMax problem. As this problem is also 
the focus of our present work, we give a short introduction here. 

The classical OneMax function counts the number of ones in a bit string. Optimizing it 
therefore corresponds to finding the all-ones bit string. Of course, we want the performance of an 
evolutionary algorithm to be independent of the problem encoding. More specifically, the algorithms 
that we typically regard have exactly the same optimization behavior on any generalized OneMax 
function 


Om^ : {0,1}®® M;x i-A |{i G {1,... ,n} I Xj = Zi}\, 

which counts the number of positions in which the bit strings x agrees with the target string z. 
It is easy to see that 2 ; is the unique (global and local) optimum of Om^. Note also that the 
classic OneMax function counting the number of ones in a bit string is the function OM;^ with 
z = (1,..., 1). It is therefore justified to call the set {OMj. | 2 ; G {0,1}®®} the generalized OneMax 
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problem. For convenience we often drop the word “generalized” in the following. All statements 
made below hold for arbitrary OneMax functions. 

The expected runtime of most search heuristics on OneMax is n(nlogn), due to a phenomenon 
called coupon collector’s problem (see, e.g., [6, Section 1.5] or [20]). In intuitive terms, the argument 
is as follows. When the initial bit string is taken uniformly at random from {0,1}” then each bit 
has a probability of 1/2 of being in the wrong initial configuration, i.e., it has to be touched 
with probability 1/2. The coupon collector’s problem states that if we touch one random bit at 
a time, then it takes n(nlogn) iterations until we have touched each bit at least once. Since 
many evolutionary algorithms (including, for example, (1 + 1) EA and Randomized Local Search) 
change on average one bit per iteration, this implies the fl(nlogn) bound. It was a long-standing 
open question whether genetic algorithms can perform better on OneMax than this lower bound. 
Sudholt [23] gave a first example of a crossover-based genetic algorithm outperforming the (1+1) EA 
on OneMax. But while his (2+1) GA (for a suitably chosen mutation rate) is better by a constant 
factor than the (1 + 1) EA, it does not improve upon RLS. The (1 + (A, A)) GA thus gave a first 
positive answer to this question, as we recall in the next section. 

2.2 Runtimes for Static and Fitness-Dependent Population Sizes 

The following statement, proven in [7], provides a tight bound for the expected runtime of the 
(1 + (A, A)) GA on OneMax. 

Theorem 1 ([7]). The expected optimization time of the (1 + (A, A)) GA withp = X/n and c = 1/A 
on every generalized OneMax function is 

Jrelog(n) nAloglog(A) I A 
log(A) })■ 

Consequently, A = 0(y^log(n) log log(n)/log log log(n)) is the optimal choice for the parameter A 
and this yields an expected optimization time of 0(n Y^log(n) log log log(n)/ log log(n)). 

It is also known [9] that a fitness-dependent (and thus, inherently non-static) choice of the 
population size can decrease the runtime even further. 

Theorem 2 (Theorem 8 in [9]). The expected runtime of the (1 + (A, A)) GA withp = X/n, c = 1/A, 
and fitness-dependent choice of X := ['■\/n/(n — /(x))] on every generalized OneMax function is 
linear in n. 

We will use the latter bound in our analysis of the self-adjusting (1 + (A, A)) GA. More precisely, 
we show that the population sizes suggested by the one-fifth success rule typically do not deviate 
much from the fitness-dependent choice analyzed in Theorem 2. This observation has also been 
made experimentally in [9]. Eigure 1, taken from [9] (Figure 5 in that paper) shows the close 
relationship between the self-adjusting population sizes (in red) and the optimal fitness-dependent 
ones (in black) for a typical run of the (1 + (A, A)) GA on OneMax. 

Two milestones in the analysis of the (1 + (A, A)) GA in [9] are the success probabilities of the 
mutation and the selection phase, respectively. Since we shall make use of these two bounds, we 
briefly repeat them below. 

Note for Lemma 3 that for an offspring x' of x with OM;j-value greater than OM 2 (x) — (. there 
exists at least one position i such that x' = Zi while Xi Zi. It is therefore possible to extract in 
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Figure 1: Evolution of f{x) and A in one representative run of the (1 + (A, A)) GA with self- 
adjusting A and update strength F = 1.5 on OneMax with n = 1000. The bottom non-rugged 
curve plots the htness-dependent choice A* = Y^n/(n — f{x)) analyzed in [9] (cf. Theorem 2). 


the crossover phase this entry from x' and thus increasing the overall fitness of the current best 
search point. This is why we call the mutation phase successful if OM2(x') > OM2(x) — i holds. 

Lemma 3 (Lemma 5 in [9]). In the notation of Algorithm 1, for all I and x, the probability that in 

( Om (x^ 

the mutation phase a search point x' is created with OM 2 (a:') > OMz{x)—£ is at least 1— ( — 


Lemma 4 (Lemma 6 in [9]). In the notation of Algorithm 1, consider fixed outcomes of I, x, and 
x'. Then the random outcome y of the crossover phase satisfies 

Pr[OMz{y) > OMz{x) |Om2(x') > Om2(x) - €\ 


> 1 - (1 - c(l - c 


1-1 


3 The (1 + (A, A)) GA with Self-Adjusting Population Sizes 

Being the first provable super-constant speed-up via a fitness-dependent parameter choice, the 
linear optimization time obtained in [9] is a big success in the theory of evolutionary algorithms. 
From the practical point of view, though, the question remains how in an actual application the 
user of the (1 -|- (A, A)) GA would guess the fitness-dependent optimal choice of A. In this section, 
we show that this is not needed. A self-adjusting choice inspired by the classic one-hfth rule can 
give the same (optimal, as the result in Section 4 shows) linear optimization time. To the best 
of our knowledge, this is the first result proving a reduced optimization time via parameter self¬ 
adjustment in discrete search spaces. We are optimistic that our approach can be applied to other 
discrete problems. At the end of this section, we give some general hints that might be useful for 
such purposes. 
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Figure 2: An extended version of the classification scheme from [13]. We regard in this work 
self-adjusting parameter choices. 

3.1 Terminology for Parameter Settings 

Since the literature is unanimous with respect to the terminology for parameter settings, we 
have adopted and slightly extended in this work the taxonomy of Eiben, Hinterding, and 
Michalewicz [13]. Figure 2, an adapted version of Figure 1 in [13], illustrates this classification, 
which we briefly summarize below. 

The efforts of choosing the right parameters in an evolutionary algorithm is called parameter 
setting. The first difference is between static and dynamic parameter settings. In the former the 
parameters are set before the actual run of the algorithm and they are not changed during the op¬ 
timization process. In typical applications, a parameter tuning step precedes the application of the 
EA. In this phase, suitable parameter choices are sought through initial experimental investigations, 
either for all parameters simultaneously, or in an iterative process. 

Optimizing dynamic parameter choices is called parameter control in [13]. Three principals 
are discussed: deterministic, adaptive, and self-adaptive parameter control. A dynamic parameter 
choice is called deterministic if it does not depend on the fitness landscape encountered by the 
algorithm. That is, there is no feedback between the fitness values and the dynamic parameters.^ 
Adaptive parameter choices are those dynamic rules where the update rule depends on the opti¬ 
mization process. Within this class (and this is different from the classification scheme proposed 
in [13]) we distinguish between functionally-dependent parameter choices (where the parameters 
depend only on the current state of the algorithm, i.e., the current population) and self-adjusting 
adaptive choices, where the parameters depend on the success of (all) previous iterations. It is easy 
to see that the one-fifth success rule considered in this paper classifies as a self-adjusting parameter 
setting, while the fitness-dependent parameter choice considered in Theorem 2 is an example for 
a functionally-dependent parameter choice. Einally, self-adaptive parameter choices are encoded 
themselves in the genome of the search points and are subject to variation operators. The hope is 

^As noted in [13] this does not exclude randomized update schemes. A possibly better wording would therefore 
be fixed or feedback-free update rules. 





that the better parameter choices yield better offspring and thus survive the evolutionary process. 


3.2 The Algorithm 

Recall that the one-fifth success rule in evolution strategies is used to change the step-size in a 
self-adjusting manner. When the empirical success probability is large, the step-size is increased to 
hopefully speed-up the exploration. When is it low, it is reduced to hopefully increase the chance 
of a success. This is done in a way that an average success probability of one fifth leads to no 
change of the step-size on average. 

In discrete search spaces, naturally, things are very different. However, we can still come up with 
a natural variant of the one-fifth success rule. Note that in our (1 -|- (A, A)) GA (with the suggested 
choices p = \/n and c = 1/A for the mutation rate and the crossover rate, respectively), increasing 
A will increase the success probability of one iteration, however, at the price of an increased number 
of function evaluations, that is, higher runtime. Consequently, it makes sense to increase A when 
the empirical success probability is low (to speed up the process of finding an improvement), but 
to reduce it when the success probability is large (to hopefully save computational effort). 

Taking Auger’s [1] implementation of the one-fifth success rule as example, we design the fol¬ 
lowing self-adjusting version of the (1 -|- (A, A)) GA, see also Algorithm 2. After an iteration that led 
to an increase of the fitness of x (“success”), indicating an easy success, we reduce A by a constant 
factor F > 1 (of course, not letting A drop below 1). If an iteration was not successful, we increase 
A by a factor of (since we analyze the algorithm for mutation probability p = A/n we do not 
let A exceed n). Gonsequently, after a series of iterations with an average success rate of 1/5, we 
end up with the initial value of A (unless the lower barrier of 1 was hit). 

As a technical remark we note that where an integer is required in Algorithm 2 (e.g., lines 6 
and 10) we round A to its closest integer, i.e., instead of A we regard [AJ = A — {A} if the fractional 
part {A} of A is less then 1/2 and we regard [A] := [AJ -|- 1 otherwise. 

3.3 Runtime Analysis 

We show that the self-adjusting (1 -|- (A, A)) GA (with standard parameters p = Xjn and c = 1/A) 
solves the generalized OneMax problem in linear time when the self-adjusting speed factor F is 
not too large. 

The proof of this result is rather technical. For this reason, we are only able to show a linear 
optimization time when F is smaller than a certain constant F*, but we do not make this F* 
precise. In general, making implicit constants precise is a difficult task in runtime analysis, and for 
many much simpler problems the implicit constants are not known. In the experiments conducted 
in [8], see in particular Figure 4 there, all values F G [1,2] worked well (recall that in Auger’s 
implementation [1] F = 1.5 was used). At the end of this section, we give some indication why 
T-values larger than 2.25, however, may lead to an exponential expected optimization time. 

Our main result is the following. 

Theorem 5. The optimization time of the self-adjusting (1 -|- (A, A)) GA with parameters p = A/n 
and c = 1/A on every generalized OneMax function is 0{n) for any sufficiently small update 
strength F > 

To prove Theorem 5, roughly speaking, we show that the population sizes A suggested by the one- 
fifth success rule are usually not very far from the fitness-dependent choice A* = [Y^n/(n — f{x))~\ 
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Algorithm 2: The self-adjusting (1 -|- (A, A)) GA with mutation probability p, crossover 
probability c, and update strength F. 
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Initialization: Sample x G {0,1}” uniformly at random and query f{x)‘, 

Initialize A •(— 1; 

Optimization: for t = 1, 2, 3,... do 
Mutation phase: 

Sample i from B{n,p)] 
for i = 1,..., A do 

1^ Sample mut£(x) and query 

Choose x' G {x’'^\ ..., x^^^} with /(x') = max{/(x(^)),..., f{x^^^)} u.a.r.; 

Crossover phase: 
for i = 1,..., A do 

1^ Sample •(— crosSc(x,x') and query 

If exists, choose y G {y^^\ •. • \ {x} with /(y) = max{/(y(^)),... ,/(y^'^^)} u.a.r.; 

otherwise, set y := x; 

Selection and update step: 

if /(y) > /(x) then x <— y; X <— max{A/F, 1}; 
if /(y) = /(x) then x •(— y; A ^ min{AF^/^, n}; 
if /(y) < /(x) then A ^ min{AF^/^, n}; 


analyzed in [9, Theorem 8] (which is restated above as Theorem 2). Intuitively, if A happens to be 
much larger than A*, the success probability of the (1 -|- (A, A)) GA is so large that with reasonably 
large probability one of the next iterations is successful and, as a consequence, the value of A is 
then adjusted to its previous value divided by F, thus approaching again A*. A key argument in 
this proof is the following lemma, which shows that for large values of A the success probability 
of the (1 -|- (A, A)) GA is indeed reasonably large. This lemma can be seen as a generalization of 
Lemma 7 in [9]. 

Lemma 6. Let x G {0,1}”. Let A > C'o|'Y^n/(n — /(x))]. Let q = q{X) be the probability that one 
iteration of Algorithm 1 (with parameters p = Xjn and c= 1/X) starting in x is successful. There 
exists a constant C such that for all Cq > C we have q > 1/5. 

of Lemma 6. We use the same notation as in the description of Algorithm 2. For readability 
purposes we again write A even if an integer is required. For any fixed e > 0 and for any A, the 
success probability q of increasing the fitness by at least one is (by the law of total probability) at 
least 

Pr[L G [(1 — e)A, (1-|-e)A]]- (1) 

min{Pr[/(y) > /(x) | L = £] | f G [(1 - e)A, (1 -h e)A]}. 

By Lemmas 3 and 4 it holds for any t that 


Pr[/(y) > /(x) I L = f] > 
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It thus suffices to bound from below 

(i) min{ (l - {l - {{I - \ £ £ [{I - e)A, (1 + e)A]}, 

(ii) min{l - | ^ G [(1 - e)A, (1 + e)A]}, 

(iii) Pr[L G [(1 - e)X, (1 + e)A]]. 

Bounding (i): For any £ G [(1 — e)A, (1 + e)A] it holds that 

<{1-1(1/4)'+')" 

<exp(-(l/4)'+'). 

For e < 1/25 we can thus bound expression (i) from below by 0.21. 

Bounding (ii): We set d := n — f{x) and obtain, for £ G [(1 — e)A, (1 + e)A], 

< (l “ 

This expression is at most 1/100 for large enough Cq, showing that we can bound (ii) from below 
by 0.99. 

Bounding (iii): We apply Chernoff’s bound to bound (hi) from below by 1 — 2exp(—e^A/3). 
Since A > 2Cq, this term is again larger than 0.99 for a suitably chosen Cq. 

Putting everything together we have seen that, for a suitable choice of Cq, the expression in (1) 
is strictly larger than 0.99^ • 0.21 >1/5. □ 

While the proof of Lemma 6 was rather straightforward, the proof of the main theorem, i.e.. 
Theorem 5, is much more involved. 

of Theorem 5. As in the overview given before Lemma 6 we sloppily denote in the following by A* 
our fitness-dependent parameter choice of Theorem 2; i.e., A* := \^^n/{n — f{x))~\. Note that the 
value of A* depends on the current fitness value but that this is not reflected in the abbreviation. 
To increase the readability, we omit again to specify whether A has to be rounded up or down. 

We partition the optimization process into phases. The first phase starts with the first fitness 
evaluation. A phase ends with an iteration at whose end we have increased the fitness of x and 
A < CqX* holds, for a sufficiently large constant Cq that we do not compute explicitly. (Cq is 
determined by Lemma 6.) 

We shall first show that each phase has an expected cost of 0(A*) fitness evaluations. From 
this is it not difficult to conclude the proof by arguments used in the proof of Theorem 2. 

To bound the expected cost of each phase, we distinguish between “short” phases, in which 
A < CqA* holds throughout, and “long” phases, in which A > CqA* for at least one iteration. We 
abbreviate the threshold CqA* by A. Note that A as well depends on the current fitness f{x). 
Claim 1: The expected cost of a short phase is 0(X*). 

Proof of elaim 1: Let A be the value of A at the beginning of the phase and let t be the 
number of iterations of the phase. Since A does not exceed the threshold A, there is exactly one 
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iteration in which the fitness of x is increased. That is, the value of A has first been multiplied t — 1 
times by until there was a fitness increase and the A-value has been shrunk as a consequence 
of the fitness increase. The value of A at the end of the phase is thus and this value 

is bounded from above by A (since we are considering a short phase). Since an iteration with 
parameter A requires 2A fitness evaluations, the total number of fitness evaluations is thus 

Tpt/i _1 

2AFV4 = 2A-^^^ = 0(A) = 0{X*). (2) 

i=0 

Claim 2: The expected cost of a long phase is 0(A*). 

Proof of claim 2: We split the long phase into an opening phase and a main phase. The 
opening phase ends with the last iteration in which A < A holds, so that the main phase starts with 
a A that is at least as large as A, but less than AF^/^. 

Let T denote the number of fitness evaluations during the phase and let I denote the number of 
iterations in the main (!) phase. As in the proof of Claim 1 it will be easy to see that E[r \ I = t] = 
D\* for all t and some fix constant D. The most technical part of this proof is to bound the 
probability that the main phase of a long phase requires t iterations; i.e., Pr[/ = t] given that we 
are in a long phase. We show that this probability is at most exp(—ct) for some positive constant 
c. It is well known that the geometric series exp(—ct))£gz>o converges if F^/^ < exp(c). 

The overall expected number of fitness evaluations during a long phase is thus 

OO OO 

^ E[r I / = t] Pr[I = t]< DX* exp(-ct), 

t=i t=i 

which is 0(X*) as desired. 

It remains to prove the following two claims. 

Claim 2.1: E[r | / = t] < DX*F^/‘^ for some large enough constant D. 

Claim 2.2: Given that we are in a long phase, we have Pr[I = t] = exp(—ct) for a positive 
constant c. 

Proof of Claim 2.1: The cost of the opening phase is at most 2 A^™qF*/^, where A is the 
initial value of A at the beginning of the phase and m is chosen maximally such that XF^/^ < A. 
As in (2) one shows that this sum is 0(A*). Similarly, since the initial A of the main phase is at 
most AF^/^, the cost of the main phase is at most 

t 

= A(F(*+i)/^ - l)/(Fi/^ - 1) 

i=l 


for F'> 1/(F1/4 - 1). 

Proof of Claim 2.2: As mentioned above, the main phase starts with a A that is at least A 
and strictly less than XF^/^. We are interested in the first point in time at which A is less than 
A. Note that all future values encountered in this phase are of the type AF^ for r being a multiple 
of 1/4. By regarding this exponent r, we transform the process into a biased random walk on the 
line (1/4)Z. Our starting position is 0. If an iteration is successful, i.e., if the fitness value of x has 
increased during the iteration, the process does one step of length one to the left. It does a step 
of length 1 /4 to the right otherwise (we thus pessimistically ignore the fact that A never exceeds 


12 



n). We bound the probability that it takes t or more iterations until this random walk has reached 
a value of less then 0. At this point in time the current A is less than the original A value that 
was active at the beginning of the main phase. That is, when the random walk reaches a position 
smaller than 0, A is for certain less then the then active threshold A (which, by definition, increases 
whenever the fitness value of x does). 

If an iteration is successful with probability at least q, the expected progress of this random 
walk in one iteration is (1 — q)/4 — q, which is negative since by Lemma 6 we have q > 1/5. Hence 
there exists a constant c > 0 such that (1 — g)/4 — g < —c. 

To conclude the proof of Claim 2.2, let us define random variables Aj, 1 < i < t, by setting 
Xi = 1/4 if the fitness does not increase in the zth iteration of the main phase, and setting Aj = — 1 
otherwise. We have just seen that E[Aj] < —c. Given that we are in a long phase, the probability 
that the main phase has length at least t equals Pr[Vj < 0]. This is at most 

Aj > 0], which is in turn bounded from above by Pr[X)i=i Aj > Xi] + {t — l)c]. 

We apply Chernoff’s bound—confer Theorem 1.11 in [6] for a version that allows for random 
variables that do not necessarily take positive values—to see that, as desired, this term is at most 

(~ (?- 1)(V4)0 ^ ~ l)cV25) . 

□ 


3.4 General Insights from the Runtime Analysis 

The analysis above reveals the following facts, which might be helpful in general when trying to 
use a one-fifth success rule or a related self-adjusting rule in discrete search spaces. 

The adjustment rule must fit to the limiting success probability. In the proof above, 
it was crucial that the success probability shown in Lemma 6 was a constant larger than 1/5. It is 
easy to see that if the success probability was uniformly bounded from above by a constant a < 1 /5, 
then logp{X) in expectation would increase by a positive constant in each round. Consequently, A 
would show an exponential growth, quickly leading to wastefully large values. This can partially 
be overcome by imposing an upper barrier for A (we have such a barrier, namely A < n, to ensure 
that the mutation probability p = A/n is at most 1), however, this would still lead to the algorithm 
mostly working with this maximal value of A instead of a value close to the ideal A*. 

In general, there is no reason for not trying success rules with other ratios 1/r than one-fifth, 
that is, increasing A by instead of in case of non-success. In general, a larger value of r 

will slightly decrease the speed of adjustment, but is more likely to overcome the problem described 
in the previous paragraph. Note that for our problem, when A = w(l), the success probability is 
uniformly bounded from above by (1 -|- o(l))e“^/® ~ 0.31. Consequently, the one-fifth rule avoids 
the exponential growth of A, whereas a one-third rule or a one-half rule (e.g., doubling or halving 
the parameter as in [19]) would not. 

The constant F matters. Even when the combination of update rule and success probability 
avoids an expected exponential growth of A, things can still go wrong when the update strength 
F is too large. Here is an example (where, to ease the presentation, we assume that we have 
no upper barrier on A; with an upper barrier, as above, the problem remains, though possibly to 
a smaller extent): Imagine that we start with some value Aq. Above, we saw that the success 
probability of an iteration is bounded from above by 0.31. Consequently, the probability of having 
exactly m consecutive non-successes is at least 0.31 • 0.69™. The optimization time of the last 
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iteration alone is Consequently, the expected effort of finding one improvement is at least 

O. 3 IA 0 When 0.69F^/^ > 1, this series does not converge, i.e., the expected 

effort for one improvement is infinite. In our case, this happens (at least) when F > 4.41. Note 
that this was a rough estimate aimed at quickly demonstrating that large i^-values can be dangerous. 
Better values can be achieved with more effort. E.g., the probability that among 6m iterations, we 
have at most m successes, is more than (0.755 + o(l))™'; this can be seem from approximating the 
binomial distribution with a normal distribution. Since this event also increases the initial A value 
by the corresponding series already diverges for F > 3.08. Optimizing the ratio of successes 

and non-successes, we see that the probability of having 7 successes among l-t-hy trials is more than 
(0.8167 -|- 0 ( 1 ))™', showing that any F > 2.25 leads to an exponential expected optimization time. 
We do not know to what extent this argument can be improved. For this reason, we would rather 
suggest to choose a small value of F, clearly below 2 , and trade in the possibly faster adjustment 
to the ideal parameter value for a reduced risk of an expected infinite optimization time. 

4 A Linear Lower Bound for All Possible Parameter Choices 

In the previous section we have seen that the self-adjusting (1 -|- (A, A)) GA is faster on average 
than with any static population size. We next show that it is asymptotically best possible also 
among all dynamic parameter choices. That is, regardless of how the parameters are updated in 
each iteration, the (1 -|- (A, A)) GA always has an expected runtime on OneMax that is at least 
linear in n. 

Theorem 7. For every (possibly dynamic) choice of the mutation probability p, the crossover 
probability c, and the population size A, the (1-|-(A, A)) GA performs at least linearly many function 
evaluations on average before it evaluates for the first time the unique global optimum. That is, 
regardless of the parameter update scheme, the expected runtime of the (1-|-(A, A)) GA on OneMax 
is Tl[n). 

For parameter choices that do not depend on absolute fitness values. Theorem 7 follows from 
an elegant technique in black-box complexity. Since we believe this intuitive argument to be of 
general interest to the Self-* research community, we present it in the following section. It can be 
seen as a nice, yet powerful tool for analyzing the limitations of evolutionary and other black-box 
optimization algorithms. 

For fitness-dependent parameter choices. Theorem 7 also holds, but needs a different argument 
as we shall comment in Section 4.2. 

4.1 Lower Bound for Self-Adjusting Parameter Choices 

We start the exposition of the lower bound by introducing the concept of comparison-based algo¬ 
rithms. 

Definition 8. A comparison-based black-box algorithm does not make use of absolute fitness values. 
Instead, it bases all decisions solely on the comparison of search points. 

To see that the self-adjusting (1 -|- (A, A)) GA is a comparison-based algorithm, we reformulate 
the algorithm slightly, see Algorithm 3. This alternative presentation shows that indeed all decisions 
of the (1-|-(A, A)) GA are entirely based on comparisons between at most two search points. A lower 
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bound that holds for all comparison-based algorithms thus immediately implies a lower bound for 
the (1 -|- (A, A)) GA. We therefore expand our view and regard the whole class of comparison-based 
algorithms. Theorem 7 for non-fitness-dependent parameter choices follows from the following 
statement, which is folklore knowledge in black-box complexity and has been formally stated (in 
much more general form) in [24, Corollary 2]. 


Algorithm 3: A reformulation of the (1 -|- (A, A)) GA with possibly adaptive (but not fitness- 
dependent) parameters p, c, and A. 

1 Initialization: Sample x G {0,1}” uniformly at random; 

2 Optimization: for t = 1, 2, 3, ... do 

3 

Depending on x (but not f{x)) choose X £ [n], p £ [0, 1], and c G [0, 1]; 

4 

Mutation phase: 

5 


Sample £ from B{n,p) and x' := x^^'> <— mut£(x); 

6 


for i = 2,..., X do 

7 


Sample £- mut£(x); 

8 


if /(x*^*^) > /(x') then x' x^*^; 

9 

Crossover phase: 

10 


Initialize y := £- crosSc(x, x'); 

11 


for i = 2,..., X do 

12 


Sample •(— crosSc(x, x'); 

13 


if (y(*) / X and /(y*^*^) > /(y)) then y ^ y(*); 

14 

Selection and update step: 

15 


if fiy) > f{x) then x ^ y; 


Theorem 9. Every comparison-based algorithm needs at least n(n) comparisons on average to 
optimize a generalized OneMax function. 

The intuitive argument for Theorem 9 is pretty simple. In order to optimize a OneMax function 
Om^ we need to identify z. That is, we need to learn the n bits of z. Roughly speaking, with each 
query we learn at most one bit of information about z, namely in the mutation phase we learn 
whether > Om^^x') or not, and in the crossover phase we learn the bit whether or not 

OMz{y^^'l) > OMz{y). Finally, we learn the one bit of information whether or not OMz{y) > OM 2 (x). 
Thus one iteration with A offspring in the mutation and the crossover phase each gives us a total 
of 2(A — 1) -|- 1 bits of information. Thus, amortized over the 2A offspring that were created, this is 
a bit less than one bit of information per query. Since we need to learn all n bits of z, this shows 
(intuitively) that we need to sample and compare at least n search points in total. This implies 
the lower bound. 

It is not too difficult to make this intuitive argument formal. To this end, one employs Yao’s 
Minimax Principle [25], a powerful tool in black-box complexity. The interested reader can find 
a quite accessible exposition of Yao’s Principle along with some easy to follow examples from 
evolutionary computation in [12]. 
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4.2 Lower Bound for Fitness-Dependent Parameter Choices 

The proof given in Section 4.1 does not work for parameter choices that possibly depend on the 
absolute fitness of intermediate search points. Intuitively, the problem here is that Om 5 ;(x) G [0..n] 
and knowing (or, rather, using knowledge about) the absolute fitness value of x provides log 2 (n + l) 
bits of information. This would only yield a lower bound of order n/ logn. Still the statement, i.e.. 
Theorem 7, holds also for fitness-dependent parameter choices, as we shall briefly comment in this 
section. Note that this bound also implies the optimality of the fitness-dependent choice suggested 
in [9]. 

In the analysis of [7] it is shown (for the recommended parameter choices p = \/n and c = 1/A) 
that the expected fitness gain in one iteration of the (1-|-(A, A)) GA with population size A is of order 
at most log A/log log A (see the proof of the upper and lower bounds of Theorem 2 in [7]). That 
is, for “investing” 2A function evaluations we obtain a fitness gain of order at most log A/log log A. 
This shows that the average fitness gain per function evaluation cannot be more than constant, thus 
implying the linear lower bound. To make this formal, one can use the additive drift theorem [16]. 

To show Theorem 7 in full generality, one has to redo the analysis from [7] for general p and c, 
a tedious but straightforward work that we omit here. 

5 Conclusions 

We have analyzed the (1 -|- (A, A)) GA with self-adjusting population sizes. We have shown that it 
optimizes any generalized OneMax function in linear time. This is best possible for any (static 
or dynamic) parameter choice and is better by a 0(Y^log(n) log log log(n)/log log(n)) factor than 
any (1 -|- (A, A)) GA with static population size. Our result thus shows for the first time that 
self-adjusting parameter choices can be provably beneficial in discrete optimization problems. 

We hope that our work inspires more work on the running time of evolutionary algorithms with 
self-adjusting parameter choices. We have provided some general insights that should be regarded 
when implementing a one-fifth success rule in discrete search spaces. 

While our work focuses on an adjustment of the population size, we are confident that also for 
other parameters of evolutionary algorithms, e.g., the mutation or crossover rates, self-adjusting 
choices can be analyzed theoretically. 
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